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ABSTRACT 


In the absence of a central naming authority on the Semantic Web, it is common for different data sets to 
refer to the same thing by different names. Whenever multiple names are used to denote the same thing, 
owl:sameAs statements are needed in order to link the data and foster reuse. Studies that date back as far as 
2009, observed that the owl:sameAs property is sometimes used incorrectly. In our previous work, we 
presented an identity graph containing over 500 million explicit and 35 billion implied owl:sameAs 
statements, and presented a scalable approach for automatically calculating an error degree for each identity 
statement. In this paper, we generate subgraphs of the overall identity graph that correspond to certain error 
degrees. We show that even though the Semantic Web contains many erroneous owl:sameAs statements, it 
is still possible to use Semantic Web data while at the same time minimizing the adverse effects of misusing 
owl:sameAs. 


* Corresponding author: Joe Raad (E-mail: j.raad@vu.nl; ORCID: 0000-0002-7891-7738). 
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1. INTRODUCTION 


As the Web of Data continues to grow, more and more large data sets — covering a wide range of topics 
and domains — are being added to the Linked Open Data (LOD) Cloud. It is inevitable that different data 
sets, most of which are developed independently of one another, will come to describe (aspects of) the 
same thing, but will do so by referring to that thing by different names. This situation is not accidental: it 
is a defining characteristic of the (Semantic) Web that there is no central naming authority that is able to 
enforce a Unique Name Assumption (UNA). Thanks to identity links, data sets that have been constructed 
independently of one another are still able to make use of each other’s information. As a consequence, 
identity link detection, i.e. the ability to determine — with a certain degree of confidence — that two different 
names in fact denote the same thing, is not a mere luxury but is essential for Linked Data to work. The 
most common property that is used for interlinking data on the Web is owl:sameAs. An RDF statement of 
the form “x owl:sameAs y” indicates that every property attributed to x must also be attributed to y, and 
vice versa. 


Over time, an increasing number of Semantic Web studies have shown that the identity predicate is used 
incorrectly for various reasons (e.g. heuristic entity resolution techniques, lack of suitable alternatives for 
owl:sameAs, context-independent classical semantics). This misuse has resulted in the presence of a number 
of incorrect owl:sameAs statements in the LOD Cloud, with some studies estimating this number to be 
around 2.8% [1] or 4% [2], whilst other studies suggest that possibly one out of five owl:sameAs in the 
Web is erroneous [3]. 


In order to limit the problem of identity links in the Semantic Web, a number of different attempts have 
emerged over the recent years (for a more exhaustive background, we refer the reader to [4]). In 2007, the 
authors of [5] proposed the centralised entity naming system OKKAM as a way to encourage the reuse of 
existing names. This centralised solution did not receive uptake in the wider Semantic Web community. 
This is not entirely unexpected, since the introduction of a centralised naming authority amounts to a strong 
departure from the original (Semantic) Web ideology, while at the same time requiring significant changes 
to the use and interchange of Linked Data in practice. Other approaches tried to limit the Semantic Web 
identity problem by providing centralised access to identity statements that are published in a decentralised 
way. Such centralised identity services allow Linked Data consumers to make an informed decision regarding 
the quality of identity statements they encounter. Although such services can in theory be useful for 
addressing the identity problem, current services suffer from semantic [6] and/or coverage [7] limitations. 
Firstly, the ‘identity bundles’ that the Consistent Reference Service [6] provides access to, are the result of 
the equivalence closure over owl:sameAs statements in addition to statements with different or no semantics. 
For example, umbel:isLike statements denote similarity instead of identity and are symmetric but not 
transitive; skos:exactMatch statements are symmetric and transitive, but indicate “a high degree of confidence 
that the concepts can be used interchangeably across a wide range of information retrieval applications,” 
which is semantically very different from the notion of identity. As a result, the semantics of the closure 
that is calculated over this variety of statements is unclear. On the other hand, LODsyndesis [7] is a 
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co-reference service based solely on owl:sameAs statements, but the number of triples it covers is an order 
of magnitude smaller than the ones we present in this work. 


Besides the above described services, various approaches have been proposed for detecting erroneous 
identity statements, based on the similarity of textual descriptions associated to a pair of linked names [8], 
Unique Name Assumption (UNA) violations [9, 10], logical inconsistencies [1, 11], network metrics [12], 
and crowdsourcing [13]. However, existing approaches either do not scale in order to be applied to the 
LOD Cloud as a whole, or they make assumptions about the data that may be valid in some data sets but 
not in others. For example, in the LOD Cloud not all names have textual descriptions, many data sets do 
not include vocabulary mappings, or they lack ontological axioms and assertions that are strong enough 
for deriving inconsistencies from. While all of the here mentioned approaches for erroneous identity links 
detection are useful in some cases, this paper presents a solution that can be applied to all data sets of the 
entire LOD Cloud. 


In our previous work [14], we presented an identity graph of over half a billion owl:sameAs statements, 
obtained from a large crawl of the LOD Cloud, which includes links between resources from thousands of 
namespaces. We have also presented the deductive closure of this identity graph, which consists of over 
35 billion implied identity statements. In other previous work [15], we presented a scalable approach for 
automatically assigning an error degree to each owl:sameAs statement in a given RDF graph, based on a 
community detection algorithm. In an extension of both these works, we show in this paper that even 
though the LOD Cloud contains erroneous owl:sameAs statements, it is still possible to use a very large 
subset of the LOD Cloud while at the same time minimizing the adverse effects of erroneous owl:sameAs 
statements. Specifically, we show that for the first time, Linked Data practitioners can now control in 
practice, the trade-off between (a) using more identity links, possibly not all correct, and benefiting from 
more contextual information from the LOD Cloud, and (b) using a smaller subset of correct identity links 
for limiting the risk of propagating erroneous identity links and information through the application of 
owl:sameAs semantics, i.e. transitive, symmetric, reflexive and property sharing. Since the choice of a 
certain subset must depend on a certain quality metric, we rely on the error degrees assigned in [15] as 
metric of identity links’ quality, and on the approach presented in [14] for deducting new -higher quality- 
identity sets in a computationally efficient manner. As a first experiment on how Linked Data practitioners 
can benefit from such a trade-off, we generate in this work two new identity subgraphs, in which we 
compute their transitive closure, and analyze their new resulted identity sets. In the first identity subgraph, 
we adopt a more conservative approach for dealing with the identity problem at hand by considering only 
links of higher quality (i.e. with low error degree). In the second identity subgraph, we aim at limiting the 
risk of incorrect identity assertions while at the same time minimizing the loss of reachable information. 
This is done by discarding a smaller subset, consisting solely of highly questionable links (i.e. with high 
error degree). Both these subgraphs with their resulted identity sets after transitive closure are publicly 
available®. 


© https://zenodo.org/record/3345674. 
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The rest of this paper is structured as follows. Section 2 presents our approach for calculating the 
deductive closure of large identity graphs. Section 3 presents our approach for automatically assigning error 
degrees to owl:sameAs statements. Implementation and empirical evaluation of the deductive closure of a 
large identity graph obtained from the LOD Cloud of the two approaches, are presented in Section 4. 
Section 5 shows experimentally how the two approaches can be combined for selecting new, higher quality, 
identity subgraphs from the overall identity graph based on specified error degrees. Section 6 presents our 
conclusions on the conducted work. 


2. EQUIVALENCE CLOSURE COMPUTATION 


This section presents our approach for calculating and storing the deductive closure of a large collection 
of owl:sameAs statements (see [14] for more details). The problem at hand can be defined as follows: let 
N denote the set of RDF terms (i.e. IRIs, literals, and blank nodes) that appear in the subject or object 
position of at least one, non-reflexive, owl:sameAs triple. A partitioning of N is a collection of non-empty 
and mutually disjoint subsets N, c N (called partition members) that together cover N. In a network solely 
composed of N with their corresponding owl:sameAs edges, the connected components of this network 
are called identity clusters, and its partition members are called identity sets. According to the owl:sameAs 
semantics, all RDF terms belonging to the same identity set denotes the same real world entity: Vx, y € Nk 
— x = y. In this work, we do not consider singleton identity sets, which are the result of terms that solely 
appear in reflexive owl:sameAs statements, and the result of terms which do not appear in any owl:sameAs 
statement. 


In order to calculate the closure, each identity set should be closed under equivalence, while taking in 
consideration multiple dimensions of complexity: 


The closure can be too large to store. We will later see that the LOD Cloud contains identity sets 
with cardinality well over 100K. It is not feasible to store the closure materialisation of each identity 
set, since the space consumption of that approach is quadratic in the size of the identity set (e.g. the 
closure of an identity set of 100K terms contains 10B identity statements). For this, we do not store 
the materialisation of the closure, but store the identity sets themselves, which is only linear in terms 
of the size of the universe of discourse (i.e. the set N of RDF terms). 


IN,| can be too large to store. Even the number of elements within one identity set can be too large 
to store in memory. Since our calculation of the closure must have a low hardware footprint and must 
be future proof, we do not assume that every identity set is always small enough to fit in memory. 


Data sets change over time. We calculate the identity closure for a large snapshot of the LOD Cloud. 
Since existing data sets in the LOD cloud are constantly changing, and new data sets are constantly 
added, our approach must support incremental updates of the closure. Specifically, our approach 
must allow for additions and deletions to be made over time, without requiring the entire closure to 
be recomputed. 
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Our approach is composed of the following steps: (1) extract the explicit owl:sameAs statements, (2) 
remove the owl:sameAs statements that are not necessary for calculating the deductive closure, and (3) 
compute the closure by partitioning the owl:sameAs network into several identity sets. 


2.1 Explicit Identity Network 


Given any RDF graph as input, e.g. the RDF merge of (a scrape of) the LOD Cloud, the first step of our 
approach is to extract the identity statements from this data graph (Definition 1). 


Definition 1 (Data Graph). A data graph is a directed and labelled graph G = (V, E, >, lẹ). V is the set 
of RDF terms and E is the set of node pairs or edges. >, is the set of edge labels. |, : E > 2%: is a function 
that assigns to each edge e E a set of labels belonging to ¥ (with le) representing the labels denoted 
to e). 


From a given data graph G, we can extract the explicit identity network Gex (Definition 2), which is a 
directed labelled graph that only includes those edges whose labels include owl:sameAs. 


Definition 2 (Explicit Identity Network). Given a graph G = (V, E, ¥ẹ lẹ, the corresponding explicit 
identity network Gex = (Vex, Ee) is the edge-induced subgraph Gl{e e E | {owl:sameAs} c le). Va is the 
set of RDF terms that appear in the subject and/or object position of at least one owl:sameAs statement 
(Vx GV). E is the set of node pairs or edges for which a statement (x, owl:sameAs, y) has been asserted 
in G (Ex c E). 


2.2 Identity Network: Construction 


Since identity is a reflexive, symmetric and transitive relation, the size of the explicit identity network 
can be significantly reduced prior to calculating the deductive closure. Specifically, we can reduce the size 
of Ge into a more concisely represented undirected and labelled identity network / (Definition 3), without 
losing any information. Since reflexive owl:sameAs statements are implied by the semantics of identity, 
there is no need to represent them explicitly. In addition, since the symmetric statements e; and e; make 
the same assertion: that v, and v; refer to the same real-world entity, we can represent this more efficiently, 
by including only one undirected edge with a weight of 2. A weight of 1 is assigned for edges which either 
ej Or e; are present in Vy but not both. 


Definition 3 (Identity Network). Given Ga = (Væ Ee), the identity network is an undirected labelled 
graph | = (V, E, {1, 2}, w), where V, is the set of RDF terms (V, € Va), and E, is the set of edges. {1, 2} are 
the edges labels, and w : E, > {1, 2} is the labelling function that assigns a weight w; to each edge e;. For 
an explicit identity network Gex = (Va, Eex), the corresponding identity network I is derived as follows: 


e £,:={e,€ Ex| i<j} 
e V,: = VE], ie. the vertex-induced subgraph 
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2, ife, €F,,ande, € Ex 
e wle;):= 


1q, ifnot 


2.3 Identity Sets: Compaction and Closure 


In this step, we can further reduce the input for the closure algorithm to a more concise set of ordered 
pairs. We call this preparation step compaction. Assuming a lexicographic order over RDF terms, we can 
reduce the input for the closure algorithm from an undirected labelled graph to a more concise set of pairs: 
{(x, y) | €y ^ x < y}. After compaction, we partition the terms V, into different identity sets. 


The partition of V, into different identity sets consists of a mapping? from nodes to identity sets (V, KH 
P(V,)). For space efficiency, we store each identity set only once by associating a key with each identity set: 
ID œ>, P(V); and map each RDF term to the key of the unique identity set that it belongs to val : V, >, ID. 
Hence, val(x) gives us the identity set ID of an RDF term x, and the composition key(val(x)) gives us the 
identity set of x. For partitioning V, into different identity sets, we use an incremental algorithm that parses 
each sorted identity pair (x, y), representing the output of the identity network compaction. The algorithm 
distinguishes between the following cases: 


Case 1. Neither x nor y occurs in any identity set. A new identity set id is generated and assigned 
to both x and y: x By, id; y Hy, id; and id œ, {x, y} 


Case 2. Only x already occurs in an identity set. In this case, the existing identity set of x is extended 
to contain y as well: y by, val(x); val(x) œ, key (val(x)) U {y} 


Case 3. Only y already occurs in an identity set. Similar to the previous case. 


Case 4. x and y already occur, but in different identity sets. In this case one of the two keys is 
chosen and assigned to represent the union of the two identity sets: 


val(x) Hy, key(val(x)) U key(vally)) 
(Vy' e key(val(y)))(y' Ry val(x)) 


This is the costliest step, especially when both identity sets are large, but it is also relatively rare due to 
the sorting that was applied as part of the compaction step. A further speedup is obtained by choosing to 
merge the smaller of the two sets into the larger one instead of the other way round. 


3. IDENTITY LINKS RANKING 


In the previous section we presented our approach for calculating and storing the deductive closure of 
a large identity graph. In this section we present a scalable approach for automatically assigning an error 
degree for each owl:sameAs statement (see [15] for more details). 


® Note that each term in V, does indeed belong to a unique non-singleton identity set. 
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Our approach consists of applying an existing community detection algorithm to each identity cluster, 
i.e. connected component of /. Based on the detected communities in each identity cluster, an error degree 
is calculated and assigned for each owl:sameAs link. The error degree that is assigned to an owl:sameAs 
link depends on two factors: (a) the density of the community in which the two linked RDF terms reside, 
or the density of the interlinks between the two communities in case the linked RDF terms are not in the 
same community; and (b) whether the identity link is reciprocally asserted or not (i.e. has a weight of 2). 
This error degree is subsequently used for ranking identity links, allowing potentially erroneous links to be 
identified and potentially true links to be validated. We believe community detection to be a particularly 
good fit for scalable identity error detection, since it only requires the identity network itself as input. 
Existing approaches require additional inputs that are not always available (e.g. resource descriptions, 
property mappings, vocabulary alignments) and/or that do not always scale to the LOD Cloud (e.g. 
crowdsourcing). Also, the use of community detection for identity error detection does not require additional 
assumptions that may hold for some data sets, but not hold for others. For instance, the UNA is known not 
to hold for data sets that are constructed over a longer period of time and/or by a larger group of contributors. 
In order to be able to apply our approach on a large scale, including application to (a large scrape of) the 
LOD Cloud, we identify the following additional requirements for our algorithm: 


Low memory footprint. Detection of erroneous identity links must not have a large memory footprint, 
since it must be able to scale to very large identity networks. In addition, we want our approach to 
be accessible to most researchers, by being able to run on a regular laptop. 


Parallel computation. It must be possible to perform computation in parallel. With the increase of the 
number of cores, even on regular laptop and consumer devices, this will allow identity link error 
degrees to be computed more efficiently. Preferably, error degrees can be calculated immediately 
after publication of an owl:sameAs link in the LOD Cloud. 


Updates. Calculation must be resilient against monotonic and non-monotonic updates. Since data are 
added to and removed from the LOD Cloud constantly, adding or removing an owl:sameAs statement 
must only require a re-ranking of the concerned links and must not require a complete recalculation. 


Our approach is composed of the following main steps: (1) detect the community structure within each 
identity cluster, and (2) assign an error degree to each owl:sameAs link in the identity cluster. By relying 
on the identity network constructed in Section 2.2, and the resulted identity sets in Section 2.3, constructing 
the identity clusters is a straightforward phase. 


3.1 Community Detection 


Despite the absence of a universally agreed upon definition, communities are typically thought of as 
groups that have dense connections among their members, but sparse connections with the rest of the 
network. Community detection is a form of data analysis that seeks to automatically determine the 
community structure of a complex network, requiring solely information that is already encoded in 
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the network topology. Detecting a network’s community structure is of great importance in many concrete 
applications and disciplines such as computer science, biology, and sociology, disciplines where 
systems are often represented as graphs. This has led to the emergence of several community detection 
algorithms, mostly making use of techniques from physics (e.g. spin model, optimization, random walks), 
as well as making use of computer science concepts and methods (e.g. non-linear dynamics, discrete 
mathematics) [16]. All such techniques aim at identifying group of nodes which are connected “more 
densely” to each other than to nodes in other groups. Hence, the differences between such methods 
ultimately come down to the precise definition of “more densely” and the algorithmic heuristic followed 
to identify such groups [17]. 


3.1.1 Community Detection Algorithms 


Having a large number of community detection techniques, we relied on existing surveys for choosing 
the best performing community detection algorithm for our task. In their 2009 survey, Lancichinetti and 
Fortunato [18] carried out a comparative analysis of the performances of 12 community detection algorithms, 
that exploit some of the most interesting ideas and techniques that have been developed over the last years. 
The tests were performed against a class of benchmark graphs, with heterogeneous distributions of degree 
and community size, including the GN benchmark [19], the LFR benchmark [20, 21], and some random 
graphs. This study concludes that the modularity-based method by Blondel et al. [22], the statistical 
inference-based method by Rosvall and Bergstrom [23], and the multi-resolution method by Ronhovde and 
Nussinov [24] all have an excellent performance, with the additional advantage of low computational 
complexity. In a more recent study, Yang et al. [25] compared the results of 8 state-of-the-art community 
detection algorithms in terms of accuracy and computing time. Interestingly, only half of these algorithms 
were considered in the previous survey, with the tests also being conducted on the LFR benchmark. This 
study concludes that by taking both accuracy and computing time into account, the modularity-based 
method by Blondel et al. [22] outperforms the other algorithms. Given that [22] outperforms the other 15 
algorithms in two different studies, with an additional advantage of low computational complexity, we 
deploy this algorithm for detecting the community structure in the owl:sameAs network. Next section 
presents an overview of this algorithm. 


3.1.2 Louvain Algorithm 
Given an identity cluster Q, the Louvain algorithm returns a set of non-overlapping communities C(Q,) 


= {C,, C, ..., C,} where: 


e a community C of size |C] (i.e. the number of nodes) is a subgraph of Q, such that the nodes of C 
are densely connected (i.e. the modularity of the Q, is maximized). 
© Uian G= Q and VC, C e CQ) s.t. i +j, GAG=@. 


The Louvain algorithm is a method for detecting communities in large networks. It is a greedy non- 
deterministic method, introduced for the general case of weighted graphs, for the purpose of optimizing 
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the modularity of the partitions. The modularity of a partition is a scalar value between —1 and 1 that 
measures the density of links inside communities as compared to links between communities. In the case 
of weighted networks, modularity is defined as follows: 


1 kik; 
Q = — Ai (C;,C;) (1) 


2m ‘7 


where: 

Aj represents the weight of the edge between the nodes i and j, 

k; and k; represent the sum of the weights of the edges attached to the nodes i and j, respectively, 
c; and ç; represent the community to which to the nodes i and j are assigned, respectively, 

2m = TDA, and representing the sum of all of the edge weights in the graph, 

6(u, v) is 1 if u = v and O otherwise. 


Modularity has been used to compare the quality of the partitions obtained by different methods, but 
also as an objective function to optimize [26]. Networks with high modularity have dense connections 
between the nodes within communities but sparse connections between nodes in different communities. 
This is the intuition of the Louvain algorithm. Firstly, it starts out by assigning a different community to each 
node of a given network. Then, it moves each node over to one of its neighbour communities, specifically, 
neighbours which result in the highest contribution to the modularity measure. In the next step, each 
community from the previous step is regarded as a single node, and the same procedure is repeated until 
the modularity (which is always computed with respect to the original graph) no longer increases. 


Although the exact computational complexity of the Louvain algorithm is not known, the method seems 
to run in time O(N log N), with N representing the number of nodes in the graph [22]. The exact modularity 
optimization is known to be NP-hard (non-deterministic polynomial-time hard), with most of the 
computational effort spent on the optimization at the first level. 


3.2 Error Degree Computation 


After detecting the community structure in each identity cluster, our approach assigns an error degree 
for each identity link based on its weight and the structure of the community(ies) in which its terms occur. 
We hypothesise that an identity link that is reciprocally asserted has a higher chance of being correct than 
a non-reciprocally asserted identity link. In addition, we hypothesise that not all resulting communities have 
the same quality. Based on these assumptions, we assign a lower error degree to owl:sameAs links that 
connect two terms that are within a densely connected community, or that connect two terms in two heavily 
interlinked communities. More precisely, to compute an error degree for each owl:sameAs link, we 
distinguish between two types of possible links: the intra- and inter-community links. 
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Definition 4 (Intra-Community Link). Given a community C, an intra-community link in C noted by 
ec is a weighted edge e; where v, and v; e C. We denote by Ec the set of intra-community links in C. 


Definition 5 (Inter-Community Link). Given two non overlapping communities C; and C, an 
inter-community link between C, and C, noted by eç, is an edge e; where v; € C; and v; e C. We denote 
by Ec, the set of inter-community links between C; and G.. 


For evaluating an intra-community link, we rely both on the density of the community containing the 
edge, and the weight of this edge. The lower the density of this community and the weight of an edge are, 
the higher the error degree will be. 


Definition 6 (Intra-Community Link Error Degree). Let eç be an intra-community link of the community 
C, the intra-community error degree of e. denoted by err(e,) is defined as follows: 


1 W. 
= 1 is 2 
oreo) = en) -| amc) j 


where Wc =}, „„ wleo). 


ec eee 

For evaluating an inter-community link, we rely both on the density of the inter-community connections, 
and the weight of this edge. The less the two communities are connected to each other and the lower the 
weight of an edge is, the higher the error degree will be. 


Definition 7 (Inter-Community Link Error Degree). Let eq, be an inter-community link of the 
communities C; and C, the inter-community error degree of Eq, denoted by err(e,,;) is defined as follows: 


1 We. 
= i 
err(eç ) = x] 1 (3) 
ij wlec) 2x | C, | x | C; | 


where Wg =} wlec ) 


e- €E, -ij e 
Cj Cj oi 


4. IMPLEMENTATION AND EXPERIMENTS 


In the previous two sections, we presented our approach for computing and storing the identity graph 
and its deductive closure, as well as our approach for assigning an error degree for each owl:sameAs link. 
In this section, we describe how these two approaches can be combined and used in practice to deal with 
the Semantic Web identity problem. The overall workflow of the identity network extraction, compaction 
and closure is displayed in Figure 1. 
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LOD-a-lot dataset Explicit Identity Network Identity Network Set of Sorted Pairs Identity Sets 
(LOD Cloud 2015) 558.9 million owl:sameAs 331million weighted edges 331 million pairs (non-singleton) 
28.3 billion triples 179.73 million terms 179.67 million terms 179.67 million terms 48.9 million sets 


a 
b b (a, c) IS, 
Extraction Construction 1 1 f Compaction Closure 
(b, c) 
O ° Ó > Cc» Cc» 


c 
2 (c, d) IS 


1 
g g f,g 
D > (f, g) 


d 


Figure 1. Workflow of the identity network extraction, compaction and closure. 


4.1 Data Graph 


We conduct our experiments on the LOD-a-lot data set [27], but any crawl of the LOD Cloud can be 
used as the starting point of our approach. We refer to this initial large-scale crawl of the LOD Cloud as 
our data graph (Definition 1). LOD-a-lot is the graph merge of the data obtained by the large-scale crawling 
and cleaning infrastructure of the LOD Laundromat [28]. The result of this merge is published in a single, 
publicly accessible Header Dictionary Triples (HDT) file that is 524 GB in size. This data graph contains 
over 28.3 billion unique triples, over 5 billion unique terms, that are related by over 1.1 billion unique 
predicates. 


4.2 Explicit Identity Network Extraction 


We use the LOD-a-lot HDT Dump to extract the explicit identity network (G,,), and the HDT C++ library® 
to stream the result set of the following SPARQL query to a file. This process takes around 27 minutes: 


select distinct ?s ?p 20 { 
bind (owl:sameAs AS 2p) 
2s 2p 20 } 


The results of this query are unique (keyword distinct) and the projection (2s ?p ?o) returns triples instead 
of pairs, so that regular RDF tools for storage and querying can be used. The explicit identity statements 
are stored in the order in which they are asserted by the original data publishers. This SPARQL query returns 
more than 558.9 million triples that connect 179.73 million terms. These owl:sameAs triples are written to 
an N-Triples file, which is subsequently converted to an HDT file. The HDT creation process takes almost 
four hours using a single CPU core. The resulting HDT file is 4.5 GB in size, plus an additional 2.2 GB for 
the index file that is automatically generated upon first use. This large collection of identity statements can 
be queried® directly from the browser or downloaded as HDT®. 


®  https://github.com/rdfhdt/hdt-cpp. 
®  https://krr.triply.cc/krr/sameas/graphs or http://sage.univ-nantes.fr/sparql/sameAs. 
® https://zenodo.org/record/1973099. 
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4.3 Identity Network Construction 


From the extracted G,,, we build the identity network (Definition 3) containing around 331M weighted 
edges and 179.67M terms. As a result, we leave out around 2.8M reflexive edges and around 225M 
duplicate symmetric edges. We also leave out 67,261 nodes that only appear in such removed edges. 


4.4 Identity Sets Compaction and Closure 


The identity network can be reduced to a set of sorted pairs prior to calculating the closure. For this we 
use GNU sort which takes less than 35 minutes on an SSD disk. Then, we compute the identity closure 
that consists of a map from nodes to identity sets. In order to build an efficient implementation of this key- 
value scheme, we need a solution that (i) uses almost no memory but is allowed to use a (SSD) disk, (ii) is 
able to store billions of key-value pairs, and (iii) allows such pairs to be added/removed dynamically over 
time. For this we use the RocksDB® persistent key-value store through a SWI Prolog API? that was designed 
for this purpose, allowing to simultaneously read from and write to the database. Since changes to the 
identity relation can be applied incrementally, the initial creation step only needs to be performed once. 
The computation of the closure takes just under 5 hours using 2 CPU cores on a regular laptop. The result 
is a 9.3GB on-disk RocksDB database: 2.7GB for mapping each term to an identity set ID (V, =œ, ID ), and 
6.6GB for mapping each identity set ID to its corresponding identity set (/D+>, P(V,) ). These files are 
publicly available®. 


The number of non-singleton identity sets is 48,999,148. The LOD-a-/ot file, from which we construct /, 
contains 5,093,948,017 unique terms. This means that there are 5,044,948,869 singleton identity sets in 
the deductive closure. Figure 2 shows that the size distribution of non-singleton identity sets is very uneven 
and fits a power law with exponent 3.3 + 0.04. Around 64% of non-singleton identity sets (31,337,556 
sets) contain only two terms. There are relatively few large identity sets, with the largest one containing 
177,794 terms. By looking at the IRIs of this identity set, we can observe that it contains a large number 
of terms denoting different countries, cities, things and persons (e.g. Bolivia, Dublin, Coca-Cola, Albert 
Einstein, the empty string, etc.). This observation shows the need of detecting the erroneous owl:sameAs 
links which led to such false equivalences after transitive closure. 


® https://rocksdb.org. 
®  https://github.com/JanWielemaker/rocksdb. 
® https://zenodo.org/record/3345674. 
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Figure 2. The distribution of identity set cardinality in G,,,. The x-axis lists all 48,999,148 non-singleton identity 
sets. 


4.5 Links Ranking 


Relying on the constructed identity network / and the resulting 48,999,148 identity sets in the deductive 
closure, we reconstruct the identity clusters of /. These identity clusters represent a partitioning of / into 
connected components. Then, we apply the Louvain algorithm for detecting the community structure in 
each identity cluster. As discussed in Section 3.1.2, the Louvain method is a greedy and non-deterministic 
algorithm. Meaning that in different runs, the algorithm might produce different communities, with no 
insurance that the global maximum of modularity will be attained. For this, we have run Louvain 10 times 
on each identity cluster, and finally considered the community structure with the highest modularity. After 
detecting the communities, we assign an error degree to all edges of each identity cluster. 


4.5.1 Quantitative Evaluation 


This process of community detection and error degree computation took 80 minutes, resulting an error 
degree to each owl:sameAs statement in the identity network (around 556M statements after discarding the 
reflexive ones). The error degree distribution of these statements is presented in Figure 3. This figure shows 
that around 73% of the statements have an error degree below 0.4, whilst around 5% of the owl:sameAs 
statements have an error degree higher than 0.8. Whilst this distribution is mainly caused by the high 
number of symmetrical identity statements in the LOD Cloud, it also indicates that most identity clusters 
have a rather dense structure. The 179.67M terms of the identity network were assigned into a total of 
55.6M communities, with the communities’ size varying between 2 and 4,934 terms (averaging 3 terms 
per community). The JAVA implementation® of the ranking process, and the computed error degrees® for 
all owl:sameAs statements are publicly available. 


® http://github.com/raadjoe/LOD-Community-Detection. 
® https://sameas.cc. 
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Figure 3. Error degree distribution of all owl:sameAs statements in the LOD-a-lot. 


4.5.2 Qualitative Evaluation 


In order to evaluate the accuracy of our ranking approach, we conducted several manual evaluations. 
This evaluation is an extension of the ones conducted in [15], in terms of the number of manually evaluated 
links and the followed analyses. 


In this evaluation, we aim at defining a threshold x of the error degree, in which owl:sameAs links that 
have an error degree <x have high probability of correctness, and links which have an error degree >x have 
high probability of being erroneous. In order to determine this threshold, four of the paper’s authors were 
asked to evaluate a number of owl:sameAs links. The judges relied on the descriptions® associated to the 
terms in the LOD-a-/ot data set [27], and did not have any prior knowledge about each link’s error degree 
(i.e. whether they are evaluating a well-ranked link or not). In order to avoid any incoherence between the 
evaluations, the judges were asked to justify all their evaluations, and were given the following instructions: 
(a) the same: if two terms denote the same entity (e.g. Obama and the First Black US President), (b) 
related: not intended to refer to the same entity but closely related (e.g. Obama and the Obama 
Administration, or Obama and the Wikipedia article of Obama), (c) unrelated: not the same nor closely 
related (e.g. Obama and the Indian Ocean), (d) can’t tell: in case there are no sufficient descriptions 
available for determining the meaning of both terms (i.e. non-dereferenced IRIs and appearing solely as 
subjects or objects of owl:sameAs statements in the LOD Cloud). Based on the judges’ evaluations, we 
deploy the following terms: 


True Positives (TP) referring to owl:sameAs links which have an error degree >x and were evaluated 
by the judges as incorrect links (related or unrelated). 


® Judges were asked to not consider the owl:sameAs statements associated to a term. 
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False Positives (FP) referring to owl:sameAs links which have an error degree >x and were evaluated 
by the judges as true identity links. 


True Negatives (TN) referring to owl:sameAs links which have an error degree <x and were evaluated 
by the judges as true identity links. 


False Negatives (FN) referring to owl:sameAs links which have an error degree <x and were evaluated 
by the judges as incorrect links (related or unrelated). 


An approach of erroneous link detection can be evaluated using the classic evaluation measures of 
precision, recall and accuracy defined as follows: 


is recall = LP 


TP+ FP TP+ FN 
TP+TN 


TP +FP+TN+FN 


precision = 


accuracy = 


We consider that when a Semantic Web expert is not able to confirm the correctness of a certain identity 
link due to the absence of necessary descriptions for one of the two involved terms, an automated approach 
will have the same limitations. Following this assumption, we will not consider links judged by the experts 
as “can’t tell” in the accuracy, precision, and recall evaluations. 


Firstly the judges were asked to evaluate 200 owl:sameAs statements (50 each), representing a sample 
of each bin of the error degree distribution shown in Figure 3. The main goal of this first evaluation of a 
small set of links, is to test whether the defined error degree is indeed an indicator for correctness (i.e. 
whether the probability of finding erroneous identity links increases with the increase of the error degree). 
From the results presented in Table 1, we can observe the following: 


Table 1. Evaluation of 200 owl:sameAs links, with each 40 links randomly chosen from a certain range of error 
degree. The percentages between parentheses are calculated without considering the links evaluated as “can’t tell”. 


Error degree range 0-0.2 0.2-0.4 0.4-0.6 0.6-0.8 0.8-1 Total 
same 35 (100%) 22 (100%) 18 (85.7%) 7 (77.8%) 15 (68.2%) 97 (89%) 
related + unrelated 0 (0%) 0 (0%) 3 (14.3%) 2 (22.2%) 7 (31.8%) 12 (11%) 
related (0) 0 2 2 2 6 
unrelated 0 0 1 0 5 6 
can’t tell 5 18 19 31 18 91 
Total 40 40 40 40 40 200 


e The higher an error degree is, the more likely that the link is erroneous. 
e 100% of the evaluated links with an error degree < 0.4 are correct. 
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e When the error degree is between 0.4 and 0.8, 83.3% of the owl:sameAs links are correct, whilst in 
13.3% of the cases, such links might have been used to refer to two different, but related terms. 

e An owl:sameAs with an error degree > 0.8 is a less reliable identity statement, referring in 31.8% of 
the cases to two different terms. 


By taking the highest available threshold in this table (x = 0.8), the evaluation suggests a precision of 
31.8% ( 354+22+18+7+7 
. o 


vi 
15+7 35+22+21+9+ 22 
required in order to detect erroneous owl:sameAs statements with higher precision. Therefore, we increased 


), and an accuracy of 81.6% ( ). This suggests that a higher threshold is 


the threshold to 0.99, and manually evaluated additional samples in order to better evaluate the accuracy 
and the precision of our approach. The 200 owl:sameAs statements evaluated in Table 1 represents our first 
sample. We briefly describe in the following the three additional samples, and their purposes: 


Sample 2. It represents a set of 60 owl:sameAs statements, with error degrees between 0.9 and 1, 
randomly chosen from identity clusters of various sizes. We mention that 21 of these owl:sameAs 
were judged as “can’t tell” by the judges. 


Sample 3. In 2013, the authors of [13] manually evaluated 95 owl:sameAs statements linking Freebase 
with DBpedia terms, and they were all judged as correct. It is the only publicly available external 
gold standard from the LOD Cloud. Out of these 95 links, only 78 were found in our data graph. 
Verifying their error degrees, only one link was assigned an error degree higher than 0.99 (FP), with 
the 77 other links having an error degree ranging from 0.52 to 0.94 (TN). 


Sample 4. In order to evaluate the recall of our approach, we have verified how our approach can rank 
newly introduced erroneous owl:sameAs statements. Firstly, we have chosen 40 random terms from 
the identity network, making sure that all these terms are different and not explicitly owl:sameAs® 
(e.g. dbr:Paris, dbr:Strawberry, dbr:Facebook). From the 40 selected terms, we have generated all the 
possible 780 undirected edges between them. We added separately, each edge e; to the identity 
network with w(e,) = 1, calculated its error degree, and removed it from the identity network before 
adding the next one. The resulted error degrees of the newly introduced erroneous identity links range 
from 0.87 to 1. When the threshold is fixed at 0.99, the recall of detecting erroneous identity links 
is 93%, with 725 (TP) out of the 780 added links (TP+FN) having an error degree > 0.99. 


The evaluation of these owl:sameAs samples is presented in Table 2, with a threshold of 0.99. The results 
presented in this table suggests that our approach can classify an identity link (as correct or erroneous) with 
a 91.9% accuracy when the threshold is set at 0.99. We are aware that accuracy is sometimes misleading 
in imbalanced data sets, as it fails to detect an approach which is biased towards the dominated class of 
the data set. However, in the case of the four samples we considered, we believe that accuracy can be a 


® But be sure to include some terms that currently belong to the same identity set. 
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good estimation on how this automatic approach matches human judgement, since it performs well on 
samples dominated by correct owl:sameAs (e.g. sample 3), and samples dominated by erroneous links (e.g. 
sample 4). 


Table 2. Accuracy of the approach on the four manually evaluated samples, based on a threshold of 0.99. Links 
evaluated as “can’t tell” by the judges are discarded. 


Sample TN TP FN FP Total Accuracy 
Sample 1 97 0 12 0 109 88.9% 
Sample 2 6 20 5 8 39 66.6% 
Sample 3 77 0 0 1 78 98.7% 
Sample 4 0 725 55 0 780 92.9% 
Total 180 745 72 9 1006 91.9% 


5. FIXING OWL:SAMEAS IN THE LOD CLOUD 


This section combines our approach for computing the transitive closure of owl:sameAs (Section 2), with 
our approach that assesses the quality of owl:sameAs (Section 3). Specifically, we use the error degrees 
computed in Section 4.5 for generating new, higher quality, identity networks and sets in the LOD Cloud. 
This section makes the following contributions: 


e Presents two new identity networks, with the first identity network discarding “potentially erroneous” 
owl:sameAs, and the second identity network containing only “high quality” owl:sameAs. 

e Presents the transitive closure of the two newly constructed identity networks. 

e Presents a use case and a benchmark for evaluating the advantages and the limitations of using these 
new identity networks and their deducted identity sets. 


5.1 Link Selection Criteria 


In order to classify an owl:sameAs as “potentially erroneous” or of “high quality”, we rely on the 
manually evaluated links previously described in Table 1 and Table 2. Specifically, Table 1 shows that 100% 
of the manually evaluated links with an error degree < 0.4 are correct (57/57). Hence, suggesting that links 
with such error degrees can be classified as “probably correct”. In addition, the manual evaluation presented 
in Table 2, shows that our approach has high accuracy when the threshold is fixed at 0.99 (~92% accuracy). 
Hence, it also suggests that identity links with an error degree > 0.99 can be classified, with high degree 
of confidence, as “potentially erroneous”. In order to put these two hypotheses to the test, we construct 
two new subgraphs of the identity network with the following characteristics: 


1. All owl:sameAs statements in the identity network with an error degree > 0.99 are discarded. 
2. All owl:sameAs statements in the identity network with an error degree > 0.4 are discarded. 
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In the following, we present a quantitative and qualitative evaluation of these two identity subgraphs. In 
addition, we generate and evaluate their corresponding identity sets, and discuss their impact. These new 
subgpraphs, with their deducted identity sets, are publicly available®. 


5.2 Quantitative Evaluation 


Figure 4 presents a comparison in terms of the number of identity pairs, terms and resulting identity sets 
between the closure of (a) the original identity network containing all 331M identical pairs® in the LOD 
Cloud; (b) a subset of the identity network containing only links with an error degree < 0.99; and (c) a 
subset of the identity network containing only links with an error degree < 0.4. Firstly, this figure shows 
that the number of RDF terms that belong to non-singleton identity sets decreases from 179.67M in the 
original closure to 179.34M in the closure. This indicates that about 330K RDF terms are solely involved 
in “potentially erroneous” owl:sameAs statements. In addition, this evaluation shows that discarding the 
about 500K pairs with an error degree > 0.99 results in about 75K additional non-singleton identity sets. 
These sets are the result of the partitioning of several large probably incorrect- identity sets. On the other 
hand, considering only the 205K “highly probable” identical pairs in the LOD Cloud (i.e. linked by an 
owl:sameAs with error degree < 0.4) decrease the number of the resulting non-singleton identity sets by 
more than 70% (from 48.9M non-singleton identity sets in the original closure to 15.9M). This result is 
expected since only 53M out of the 179M terms in the LOD Cloud (about 30%) are involved in identity 
statements with such high confidence. 
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Figure 4. Comparison of the original identity network and its transitive closure, with the two newly constructed 
identity subgraphs. 


© https://zenodo.org/record/3345674. 
® Note that in the identity network, reflexive and duplicate symmetric links are discarded. 
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In order to analyze closely the impact of these new computed closures, we focus firstly on the largest 
identity set. This identity set includes originally 177,794 terms, connected by 2.8M edges that are the result 
of the compaction of 5.5M owl:sameAs (about 1% of the total number of owl:sameAs ). This identity set 
includes terms referring to a large number of different real-world entities, falsefully claimed to be the same 
(in an explicit or implicit manner). When owl:sameAs links, with error degree >0.99 are discarded 
(respectively discarding links with error degree >0.4), the terms of this identity set are partitioned into 1,086 
non-singleton identity sets (resp. 640 sets), with an average of 160 terms per set (resp. 50 terms). The largest 
of these derived non-singleton identity sets contains 3,686 terms (resp. 157 terms) and the smallest set in 
each of the two closures contains only two terms. As a result of discarding links with an error degree >0.99, 
we find that 4,146 terms, originally belonging to the largest identity set (2.3% of the terms), now appear 
in singleton identity sets. On the other hand, discarding links with an error degree >0.4, results in 145,934 
terms originally belonging in the largest identity set (82% of the terms) to be present in singleton identity 
sets (i.e. not identical to any other term anymore). 


Finally, in terms of computational efficiency, the two new deductive closures are much faster to compute. 
Compared to the original closure that takes about 5 hours using two CPU cores on a regular laptop, the 
closure (b) and (c) takes about 2.5 hours and about 1 hour, respectively. This significant difference in 
computation time between the closure (a) and (b), despite the relatively small difference in the number of 
included pairs (about 500K pairs), is consistent with our observation in Section 2.3, stating that the 
computation time for calculating the deductive closure is dominated by the time that is needed to merge 
large identity sets. 


5.3 Qualitative Evaluation 


Until now, we computed two additional equivalence closures, based on discarding owl:sameAs with 
certain error degrees. We also compared these new subgraphs to the original identity graph in terms of the 
number of resulting identity sets, number of terms included in non-singleton identity sets, and the 
computation time for deducting their identity sets. In this section, we analyze the quality of these subraphs, 
and compare it to the quality of the original identity graph (i.e. all the owl:sameAs links of the LOD Cloud). 
For conducting this qualitative analysis, we look closely at the specific use case of the identity cluster 
containing the DBpedia term of the former US president Barack Obama (dbr:Barack_Obama). This identity 
cluster, presented in Figure 5, includes 440 terms connected by 7,615 edges. These edges are the result of 
the compaction of 14,917 owl:sameAs statements. Applying the Louvain algorithm on this identity cluster 
results in four non-overlapping communities®, presented in Figure 6. 


® https:/Awww.sameas.cc/explicit/obama-class.svg. 
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Figure 5. “Barack Obama” identity cluster. 


Figure 6. Community structure of the “Barack Obama” identity cluster. 


In order to analyze the impact of discarding links with certain error degrees, the authors were asked to 
manually annotate each RDF term of this identity set. By dereferencing the IRIs, and looking at their 
descriptions in the LOD-a-/ot data set, the authors were able to annotate 338 out of the 440 RDF terms. 
These 338 terms were judged to refer to 8 different real world entities (e.g. Barack Obama the person, his 
presidency, his senate career). The other 102 RDF terms do not have enough descriptions to be confidently 
annotated by the judges, and will be discarded for the rest of this evaluation. Now that all the terms in this 
identity set are annotated, we conduct two complementary evaluations. The first evaluation conducted on 
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the links’ level (Section 5.3.1), allows us to evaluate how accurately our approach can classify each identity 
link, based on different thresholds. The second evaluation conducted on the closure level (Section 5.3.2), 
allows us to evaluate the impact of discarding few identity links on the quality of the overall closure. Both 
the code and the data necessary for replicating these analyses are publicly available®. 


5.3.1 “Barack Obama” Identity Cluster — Links Evaluation 


The identity cluster containing the DBpedia term of the former US president “Barack Obama” contains 
14,917 owl:sameAs, between 440 RDF terms. After discarding the 102 terms that could not be manually 
annotated and their corresponding links®, this identity cluster now contains 14,613 owl:sameAs linking 
338 RDF terms. Based on our manual annotation of these terms, we find that there are 26 owl:sameAs that 
link two RDF terms referring to different entities (0.17% of the links are erroneous). This evaluation shows 
that even though this identity cluster mostly contains correct owl:sameAs links, the presence of a small 
number of erroneous ones have led to the false equivalence of RDF terms referring to 8 different real world 
entities. Table 3 presents the evaluation of our approach in classifying the owl:sameAs links of this identity 
cluster according to both considered thresholds 0.4 and 0.99. From this table, we can observe that there 
are only two owl:sameAs links with an error degree > 0.99, with both these links being indeed erroneous 
(TP=2 and FP=0 in closure b). Hence, these results suggest a 100% precision (2/2), and a 7.69% recall 
(2/26) for our approach when the threshold is fixed at 0.99. Conversely, when the threshold is set at 0.4, 
the recall of our approach increases to 100% (i.e. all 26 erroneous links in this identity cluster have an 
error degree >0.4), with the precision decreasing to 3.16% (due to the additional 795 flagged links that are 
actually correct). For both thresholds, the accuracy is considerably high, which is expected in the case of 
an imbalanced data set. 


Table 3. Precision, recall and accuracy, based on two thresholds (0.99 and 0.4) for the Barack Obama identity 
set. Links evaluated as “can’t tell” by the judges are discarded. 


Closure Error degree threshold TN TP FN FP Total Recall Precision Accuracy 
(b) 0.99 14,587 2 24 0 14,613 7.69% 100% 99.83 % 
(c) 0.4 13,792 26 O 795 14,613 100% 3.16% 94.55% 


5.3.2 “Barack Obama” Identity Cluster — Closure Evaluation 


In the previous evaluation, we showed that the identity cluster of “Barack Obama” in the LOD Cloud 
contains 26 erroneous owl:sameAs that led to the false equivalence of RDF terms referring to 8 different 
real world entities. The evaluation in Table 3 shows that depending on the chosen threshold, a Linked Data 
practitioner can decide whether to remove only few erroneous owl:sameAs without removing any correct 
ones (i.e. closure b), or remove all erroneous owl:sameAs but discard at the same time a large number of 


® https://github.com/raadjoe/obama-lod-identity-analysis. 
® Note that these terms were not discarded during the computation of the error degree, but only discarded in the evaluation. 
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correct ones (i.e. closure c). In this section, we evaluate the impact of both approaches on the overall 
closure, and compare it to the original closure of all owl:sameAs links in the LOD-a-/ot data set. Table 4 
presents the differences in the resulted identity sets between the three closures, where we can observe the 
following: 


Gold Standard. Our manual annotation shows that the 338 RDF terms in the original identity cluster 
refer to 8 different real world entities. Ideally we expect that these terms to be partitioned into 8 
different identity sets {GS}, ..., GSs}, with GS, containing the 260 RDF terms referring to the person 
Barack Obama, GS, containing the 47 terms referring to his presidency, and so forth. 


Closure (a). The original closure of all owl:sameAs links in the LOD-a-lot data set results in all these 
338 RDF terms to belong in one identity set A4. These 338 terms are explicitly connected by 14,613 
owl:sameAs links, and result in 56,953 distinct pairs. 


Closure (b). After discarding the two owl:sameAs links with an error degree >0.99, the closure of the 
14,611 remaining owl:sameAs links partitions these 338 RDF terms into two non-singleton identity 
sets B, and B,. The former contains 270 RDF terms and the latter contains the remaining 68 terms. 
Table 4 shows that by solely discarding two erroneous owl:sameAs links from this identity cluster, 
our approach was able to separate all terms referring to Obama’s presidency and presidential transition® 
from the rest of the terms. However, despite this significant quality improvement, both identity sets 
are still logically inconsistent (i.e. contain terms that refer to different real world entities). 


Closure (c). After discarding the 821 owl:sameAs links with error degree >0.4, the closure of the 
remaining 13,792 correct identity links partitions the 338 RDF terms into 219 identity sets. Out of 
these identity sets, only the identity set C, is non-singleton, containing 120 RDF terms that refer to 
the person Barack Obama. Although this closure results in a large number of identity sets (compared 
to the expected 8 sets), the resulting closure is now semantically correct. 


® terms referring to the period in which Obama won the US election in November 2008 until his inauguration in January 
2009 when his presidency started. 
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Table 4. Comparison of the original identity network closure, the closure (b) and (c), with the Gold Standard. 


# Distinct Pairs 


Closure (a) Closure (b) Closure (c) 

Real World Entity A B, B, C, C, C19 
GS, | Barack Obama 260 260 0 120 1 0 
GS, | Obama’s Presidency 47 0 47 0 0 0 
GS, | Obama’s Presidential Transition 22 1 21 (0) (0) 0 
GS, | Obama’s Senate Career 5 5 (0) 0 0 0 
GS; | Obama’ Presidential Centre 1 1 0 0 0 0 
GS, | Obama’s Biography 1 1 0 0 0 0 
GS, Obama’s Photos 1 1 0 0 0 0 
GS, Black President 1 1 0 0 0 1 

# Terms in Identity Set 338 270 68 120 1 1 1 

# Identity Sets 1 2 219 

# Explicit Identity Links 14,613 14,611 13,792 


In order to further evaluate the quality of these three closures, we also rely on the classic evaluation 


measures of precision, recall and accuracy. However, since the evaluation in this case is conducted on the 


level of the resulting identity clusters, we re-define the notion of True/False Positives/Negatives for evaluating 


our approach in terms of the number of remaining correct/erroneous distinct pairs (instead of explicit 


owl:sameAs). Hence, all possible 56,953 distinct pairs X between the 338 manually annotated RDF terms 


represent the universe of discourse in the following evaluation. 


True Positives (TP). Pairs (a,b) € X, in which a and b are in the same identity cluster in both the gold 


standard and in the resulting closure. 


False Positives (FP). Pairs (a,b) € X, in which a and b are in the same identity cluster in the resulting 


closure but in different clusters in the gold standard. 


True Negatives (TN). Pairs (a,b) € X, in which a and b are in different clusters in both the gold standard 


and the resulting closure. 


False Negatives (FN). Pairs (a,b) € X, in which a and b are in the same identity cluster in the gold 


standard, but in different clusters in the closure. 


Table 5 presents the recall, precision and accuracy evaluation of each closure. For the closure (a), it is 


expected to have a 100% recall since all the 338 considered terms in this evaluation belong originally to 
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the same identity cluster. Since ideally, these terms should be partitioned into 8 different clusters, this table 
shows that the original closure of all owl:sameAs links contains 21,961 erroneous pairs, resulting in a 61% 
precision of the original closure. On the other hand, for the closure (c) it is also expected to have a 100% 
precision since all pairs (a,b) that belong to the same resulting identity cluster are indeed identical (TP). 
This result can also be observed in Table 4, where all identity sets resulting from the closure (c) are either 
singleton, either in the case of C, refer solely to the person Obama. However, since it partitions the 338 
RDF terms into 219 identity clusters (where the ideal partition is 8 clusters), the recall of this closure is 
significantly lower (20%), due to the large number of correct pairs that have been separated in this closure 
(FN). Finally, the closure (b) manages to have both a high recall (99.9%) and high precision (90.6%). This 
is due to the large number of non-identical pairs that have been separated (TN), and the large number of 
identical pairs that were kept in the same cluster (TP) after discarding the only two owl:sameAs with an 
error degree >0.99. For instance, we can see from Table 4 that all 260 terms referring to the person Obama 
are still in the same cluster B,, and at the same time separated from all 47 terms referring to his presidency. 


Table 5. Precision, recall and accuracy evaluation of the three closures. 


Closure TN TP FN FP Total Recall Precision Accuracy 
(a) 0 34,992 0 21,961 56,953 100% 61.4 % 61.4 % 
(b) 18,339 34,971 21 3,622 56,953 99.9% 90.6 % 93.6 % 
(c) 21,961 7,140 27,852 0 56,953 20 % 100 % 51% 


6. CONCLUSION 


In this paper we showed that even though a global knowledge graph like the LOD Cloud may contain 
incorrect identity statements, it is still possible to use subgraphs of this knowledge graph in a productive 
way. This is done by extracting all owl:sameAs links, computing their transitive closure, and assigning an 
error degree to each link based on the identity clusters community structure. Since the here presented 
approach only takes a couple of hours to be computed over a large scrape of the LOD Cloud (over 500 
million explicit and 35 billion implicit owl:sameAs statements) while using a regular laptop, this trade-off 
between identity subgraph size and link validity can actually be used in practice. In fact, our qualitative 
evaluation on a benchmark of about 1K owl:sameAs links, and a manually annotated identity cluster 
containing about 14K owl:sameAs links suggests that discarding a small number of links, with an error 
degree higher than 0.99 (about 0.17% of the links), can significantly enhance the quality of the resulting 
closure (correctness of the closure increases from 61% to 93%). Our results also suggest that discarding a 
larger number of links, with an error degree higher than 0.4 (about 27% of the links) can lead to semantically 
valid identity clusters (i.e. which do not contain two terms referring to different real world entities). These 
results could be further improved by combining or comparing results from multiple community detection 
methods. 
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Finally, based on the level of correctness that is required within a given application, a data practitioner 
may choose a different error degree, resulting in a smaller/larger subgraph depending on whether erroneous 
identity statements are considered more/less acceptable. In addition, some applications might choose to 
requalify links with certain error degrees to a less strict notion of identity, such as the one encoded in 
skos:closeMatch. This will allow practitioners to benefit from the presence of these asserted links, while 
reducing the semantic inconsistencies when reasoning is applied. We believe that automated and scalable 
data selection approaches, like the one presented in this paper, will see uptake over time, as they allow 
data practitioners to extract maximal value from the LOD Cloud, and thereby make the Semantic Web 
succeed as an integrated and reliable knowledge space. 
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